41 research outputs found
Weakly Supervised Visual Semantic Parsing
Scene Graph Generation (SGG) aims to extract entities, predicates and their
semantic structure from images, enabling deep understanding of visual content,
with many applications such as visual reasoning and image retrieval.
Nevertheless, existing SGG methods require millions of manually annotated
bounding boxes for training, and are computationally inefficient, as they
exhaustively process all pairs of object proposals to detect predicates. In
this paper, we address those two limitations by first proposing a generalized
formulation of SGG, namely Visual Semantic Parsing, which disentangles entity
and predicate recognition, and enables sub-quadratic performance. Then we
propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic,
attention-based, bipartite message passing framework that jointly infers graph
nodes and edges through an iterative process. Additionally, we propose the
first graph-based weakly supervised learning framework, based on a novel graph
alignment algorithm, which enables training without bounding box annotations.
Through extensive experiments, we show that VSPNet outperforms weakly
supervised baselines significantly and approaches fully supervised performance,
while being several times faster. We publicly release the source code of our
method.Comment: To be presented at CVPR 2020 (oral paper
Multi-Layer Local Graph Words for Object Recognition
In this paper, we propose a new multi-layer structural approach for the task
of object based image retrieval. In our work we tackle the problem of
structural organization of local features. The structural features we propose
are nested multi-layered local graphs built upon sets of SURF feature points
with Delaunay triangulation. A Bag-of-Visual-Words (BoVW) framework is applied
on these graphs, giving birth to a Bag-of-Graph-Words representation. The
multi-layer nature of the descriptors consists in scaling from trivial Delaunay
graphs - isolated feature points - by increasing the number of nodes layer by
layer up to graphs with maximal number of nodes. For each layer of graphs its
own visual dictionary is built. The experiments conducted on the SIVAL and
Caltech-101 data sets reveal that the graph features at different layers
exhibit complementary performances on the same content and perform better than
baseline BoVW approach. The combination of all existing layers, yields
significant improvement of the object recognition performance compared to
single level approaches.Comment: International Conference on MultiMedia Modeling, Klagenfurt :
Autriche (2012
Annotating Social Media Data From Vulnerable Populations: Evaluating Disagreement Between Domain Experts and Graduate Student Annotators
Researchers in computer science have spent considerable time developing methods to increase the accuracy and richness of annotations. However, there is a dearth in research that examines the positionality of the annotator, how they are trained and what we can learn from disagreements between different groups of annotators. In this study, we use qualitative analysis, statistical and computational methods to compare annotations between Chicago-based domain experts and graduate students who annotated a total of 1,851 tweets with images that are a part of a larger corpora associated with the Chicago Gang Intervention Study, which aims to develop a computational system that detects aggression and loss among gang-involved youth in Chicago. We found evidence to support the study of disagreement between annotators and underscore the need for domain expertise when reviewing Twitter data from vulnerable populations. Implications for annotation and content moderation are discussed